A1/A2 Recap

A1

Library Preperation

We will start by checking if the required libraries and files are installed. Then we will load the libraries using library().

if (!requireNamespace("BiocManager", quietly = TRUE)){
  install.packages("BiocManager")
}
if (!requireNamespace("GEOmetadb", quietly = TRUE)){
  BiocManager::install("GEOmetadb")
}

if (!file.exists('GEOmetadb.sqlite')){
  getSQLiteFile()
}

if (!requireNamespace("biomaRt", quietly = TRUE)){
  BiocManager::install("biomaRt")
}

if (!requireNamespace("edgeR", quietly = TRUE)){
  BiocManager::install("edgeR")
}

library(BiocManager)
library(edgeR)
## Warning: package 'edgeR' was built under R version 4.1.1
library(GEOmetadb)
## Warning: package 'RSQLite' was built under R version 4.1.2
library(biomaRt)
## Warning: package 'biomaRt' was built under R version 4.1.1
library(knitr)
## Warning: package 'knitr' was built under R version 4.1.2

Data Selection

Now with all libraries installed, we will use the following commands to query the GEOmetadb (GEO database) to look for a dataset with a “raw counts” txt-supplmentary file. Note we are specifically querying datasets with submission dates > 2015-01-01, related to cancer and high throughput sequencing data of Homo sapiens.

sql <- paste("SELECT DISTINCT gse.title,gse.gse, gpl.title,",
             " gse.submission_date,",
             " gse.supplementary_file",
             "FROM",
             "  gse JOIN gse_gpl ON gse_gpl.gse=gse.gse",
             "  JOIN gpl ON gse_gpl.gpl=gpl.gpl",
             "WHERE",
             "  gse.submission_date > '2015-01-01' AND",
             "  gse.title LIKE '%cancer%' AND", 
             "  gpl.organism LIKE '%Homo sapiens%' AND",
             "  gpl.technology LIKE '%high-throughput seq%' ",
             "  ORDER BY gse.submission_date DESC",sep=" ")
con <- dbConnect(SQLite(), 'GEOmetadb.sqlite')
rs <- dbGetQuery(con, sql)
unlist(lapply(rs$supplementary_file,
              FUN = function(x){x <- unlist(strsplit(x,";")) ;
              x <- x[grep(x,pattern="txt",ignore.case = TRUE)];
              tail(unlist(strsplit(x,"/")),n=1)})) [1:30]
##  [1] "GSE166697_Urinary_miRNA_counts.txt.gz"                                 
##  [2] "GSE166417_gene_expression.txt.gz"                                      
##  [3] "GSE165883_2019.01.15.processed.cpm.txt.gz"                             
##  [4] "GSE165842_vst.txt.gz"                                                  
##  [5] "GSE165452_ACY241_PRL_RNAseq_no_rRNA_Kallisto_Gene_Counts_matrix.txt.gz"
##  [6] "GSE165247_RawCounts_Matrix_2019.txt.gz"                                
##  [7] "GSE165115_Transcriptome_counts.txt.gz"                                 
##  [8] "GSE165115_Transcriptome_counts.txt.gz"                                 
##  [9] "GSE164730_expression_profile.txt.gz"                                   
## [10] "GSE164531_NEDD9_count_table.txt.gz"                                    
## [11] "GSE164169_T47DZOE_T47DPCDH.anno.txt.gz"                                
## [12] "GSE163366_otu_tax.txt.gz"                                              
## [13] "GSE163374_rat.raw.counts.txt.gz"                                       
## [14] "GSE162737_TE-1_mRNA_transcripts.FPKM.txt.gz"                           
## [15] "GSE162726_barcode_supplementary.txt.gz"                                
## [16] "GSE162515_RNAseq_rawCounts.txt.gz"                                     
## [17] "GSE162564_22RV1_CountTable.txt.gz"                                     
## [18] "GSE162353_S100A9_MDSC_RNAseq.txt.gz"                                   
## [19] "GSE162285_gene_raw_counts_matrix.txt.gz"                               
## [20] "GSE162215_all.fpkm.txt.gz"                                             
## [21] "GSE162104_count.txt.gz"                                                
## [22] "GSE162004_gene_expression.txt.gz"                                      
## [23] "GSE161691_TMM_normalized_reads_count_per_million_LNCaP.txt.gz"         
## [24] "GSE161509_all_samples.fragments_per_gene.txt.gz"                       
## [25] "GSE161502_Raw_Counts.txt.gz"                                           
## [26] "GSE161349_Counts_noadj_exons_condense.txt.gz"                          
## [27] "GSE161243_compiled_counts.txt.gz"                                      
## [28] "GSE160796_gene_fpkm.txt.gz"                                            
## [29] "GSE160693_Normalized_Gene_Counts_Matrix.txt.gz"                        
## [30] "GSE160723_read_counts.txt.gz"

The chosen dataset is “GSE164531” which is an RNA-seq dataset using the Illumina Hiseq 2000 platform for identification of NEDD9 regulated genes in VCaP prosatate cancer cells. We download the counts data using the following commands and check the dimensions of the data.

We then load the counts data into R and check the dimensions.

data <- read.delim(fnames[1], header = TRUE, check.names = FALSE)
dim(data)
## [1] 58736     8

Then we quickly check the head of the data (first 10 rows) to explore further.

kable(head(data), format = 'html', caption = "Table 0.1: GSE164531 Dataset")
Table 0.1: GSE164531 Dataset
EnsemblID GeneSymbol shNEDD9_1_S10 shNEDD9_2_S11 shNEDD9_3_S12 shNTC_1_S7 shNTC_2_S8 shNTC_3_S9
ENSG00000223972 DDX11L1 0 0 2 0 0 0
ENSG00000227232 WASH7P 0 0 0 0 0 0
ENSG00000278267 MIR6859-1 0 0 0 0 0 0
ENSG00000243485 MIR1302-2HG 0 0 0 0 0 0
ENSG00000284332 MIR1302-2 0 0 0 0 0 0
ENSG00000237613 FAM138A 0 0 0 0 0 0

Initial Exploratory Analysis

Duplicate Genes

The first exploratory analysis is checking for duplicate genes in our data. We run the following commands and find that there are no duplicate genes present in the data. Resulting in no further actions required to deal with duplicate genes.

summarized_gene_counts <- sort(table(data$EnsemblID),decreasing = TRUE)
kable(summarized_gene_counts[which(summarized_gene_counts > 1)[1:10]], format = 'html', caption = "Table 0.2: Duplicate genes and it's frequencies")
Table 0.2: Duplicate genes and it’s frequencies
Var1 Freq
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA

Missing Data

Now we check for any missing data in the dataset.

# Check which rows have missing data
which(!complete.cases(data))
## [1] 58736

We see that row 58736 has missing data, we access this row to explore further.

data[58736,]

Seems like this row is just an empty row with no meaningful data. As it contains no data, we can safely remove the row without affecting the data-set (We remove the row as it N/A values may interfere with next steps such as normalization).

data <- data[complete.cases(data),]
dim(data)
## [1] 58735     8

Grouping the Data

We will define groups for the data according to each biological condition (experiment design) for later steps such as normalization.

samples <- data.frame(lapply(colnames(data)[3:8], FUN=function(x){unlist(strsplit(x, split = "_"))[c(1, 2)]}))
colnames(samples) <- colnames(data)[3:8]
rownames(samples) <- c("condition", "replicate")
samples <- data.frame(t(samples))
samples

Filtering Data

We now filter out low counts from the data according to the edgeR protocol. The threshold is set to 3 because edgeR recommends the threshold to be the number of replications which in our data set is 3 (Would be interesting to see what happens if we change our threshold, but we will keep it to 3 for now).

cpms <- edgeR::cpm(data[, 3:8])
rownames(cpms) <- data[, 1]
# Threshold set to 3 as recommended by edgeR protocol
keep <- rowSums(cpms > 1) >= 3
dataFiltered <- data[keep, ]
head(dataFiltered)

We quickly compare the dimensions of the filtered data and original data.

dim(data)
## [1] 58735     8
kable(head(dataFiltered), format = 'html', caption = "Table 0.3: Filtered Data")
Table 0.3: Filtered Data
EnsemblID GeneSymbol shNEDD9_1_S10 shNEDD9_2_S11 shNEDD9_3_S12 shNTC_1_S7 shNTC_2_S8 shNTC_3_S9
20 ENSG00000279457 FO538757 6 12 12 7 10 9
34 ENSG00000225630 MTND2P28 264 298 417 365 153 310
35 ENSG00000237973 MTCO1P12 25 34 36 46 23 56
39 ENSG00000248527 MTATP6P1 835 1060 1291 1219 525 1166
44 ENSG00000228327 AL669831 14 16 22 6 3 7
51 ENSG00000228794 LINC01128 248 300 368 287 146 239

We notice that the total number of genes reduced to 14682 from 58735 after filtering out low expression data using the edgeR filtering protocol.

Mapping Data

Now we will map the filtered data to HUGO gene symbols using grch38.p13 and the biomaRt package. The following commands creates a ensembl_gene_id and HUGO gene symbol conversion table which we will use to merge with the expression data after normalization.

ensembl <- useEnsembl("ensembl")
ensembl <- useDataset("hsapiens_gene_ensembl",mart=ensembl)
conversionStash <- "data_conversion.rds"
if(file.exists(conversionStash)){
  dataIdConversion <- readRDS(conversionStash)
} else {
  dataIdConversion <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"),
                               filters = c("ensembl_gene_id"),
                               values = dataFiltered$EnsemblID,
                               mart = ensembl)
  saveRDS(dataIdConversion, conversionStash)
}

kable(head(dataIdConversion), format = 'html', caption = "Table 0.4: Ensembl Gene Id to hgnc symbol")
Table 0.4: Ensembl Gene Id to hgnc symbol
ensembl_gene_id hgnc_symbol
ENSG00000000457 SCYL3
ENSG00000000460 C1orf112
ENSG00000001084 GCLC
ENSG00000001167 NFYA
ENSG00000001460 STPG1
ENSG00000001461 NIPAL3

Normalization

We will now normalize our data using the TMM method using the following commands.

filteredDataMatrix <- as.matrix(dataFiltered[,3:8])
rownames(filteredDataMatrix) <- dataFiltered$EnsemblID
d = DGEList(counts = filteredDataMatrix, group = samples$condition)
d = calcNormFactors(d)
normalizedCounts <- cpm(d)

Now we will merge the normalizedCounts with our identifiers that we mapped in our previous step. Now we have a dataFrame with mapped ensembl_gene_id, HUGO gene symbols and normalized expression data.

dataFilteredAnnot <- merge(dataIdConversion, normalizedCounts, by.x = 1, by.y = 0, all.y=TRUE)
kable(dataFilteredAnnot[1:5,1:8],type = "html", caption = "Table 0.5: Normalized Data")
Table 0.5: Normalized Data
ensembl_gene_id hgnc_symbol shNEDD9_1_S10 shNEDD9_2_S11 shNEDD9_3_S12 shNTC_1_S7 shNTC_2_S8 shNTC_3_S9
ENSG00000000003 TSPAN6 8.056319 10.05134 9.355297 14.791616 13.169145 13.448793
ENSG00000000419 DPM1 63.443514 59.16293 58.143675 41.819933 35.431272 42.222956
ENSG00000000457 SCYL3 35.078557 32.82588 33.699189 20.842732 22.889229 22.518910
ENSG00000000460 C1orf112 24.168958 20.99330 21.829027 8.068154 5.330368 8.757354
ENSG00000001036 FUCA2 76.367193 69.34149 69.510864 52.039594 62.710216 52.387741

Pre-normalization Density vs Post-normalization Density

The density plots for pre-normalization and post-normalization data is plotted for comparison.

Pre-normalization:
pre_data2plot <- log2(cpm(dataFiltered[, 3:8]))
pre_counts_density <- apply(log2(cpm(dataFiltered[, 3:8])), 2, density)
xlim <- 0
ylim <- 0
for (i in 1:length(pre_counts_density)) {
  xlim <- range(c(xlim, pre_counts_density[[i]]$x)); 
  ylim <- range(c(ylim, pre_counts_density[[i]]$y))
  }
cols <- rainbow(length(pre_counts_density))
ltys <- rep(1, length(pre_counts_density))
#plot the first density plot to initialize the plot
plot(pre_counts_density[[1]], xlim=xlim, ylim=ylim, type="n", ylab="Smoothing density of log2-CPM", main="", cex.lab = 0.85)
#plot each line
for (i in 1:length(pre_counts_density)) lines(pre_counts_density[[i]], col=cols[i], lty=ltys[i])
#create legend
legend("topright", colnames(pre_data2plot),  
       col=cols, lty=ltys, cex=0.75,
       border ="blue",  text.col = "green4",
       merge = TRUE, bg = "gray90")
Fig. 0.1: Density Plot for Pre-normalization data

Fig. 0.1: Density Plot for Pre-normalization data


Post-normalization:
post_data2plot <- log2(cpm(dataFilteredAnnot[, 3:8]))
post_counts_density <- apply(log2(cpm(dataFilteredAnnot[, 3:8])), 2, density)
xlim <- 0
ylim <- 0
for (i in 1:length(post_counts_density)) {
  xlim <- range(c(xlim, post_counts_density[[i]]$x)) 
  ylim <- range(c(ylim, post_counts_density[[i]]$y))
  }
cols <- rainbow(length(post_counts_density))
ltys <- rep(1, length(post_counts_density))
#plot the first density plot to initialize the plot
plot(post_counts_density[[1]], xlim=xlim, ylim=ylim, type="n", 
ylab="Smoothing density of log2-CPM", main="", cex.lab = 0.85)
#plot each line
for (i in 1:length(post_counts_density)) lines(post_counts_density[[i]], col=cols[i], lty=ltys[i])
#create legend
legend("topright", colnames(post_data2plot),  
       col=cols, lty=ltys, cex=0.75, 
       border ="blue",  text.col = "green4", 
       merge = TRUE, bg = "gray90")
Fig. 0.2: Density Plot for Post-normalization data

Fig. 0.2: Density Plot for Post-normalization data


Notice no significance difference between pre and post normalization. I believe this is due to the fact that my original data happens to be close to a normal distribution, thus normalization with TMM method did not make any significant changes.

Missing Identifiers

We will now deal with missing identifiers (missing HUGO gene symbols). First, we will check how many missing identifiers are present in our data right now.

ensembl_id_missing_gene <- dataFilteredAnnot$ensembl_gene_id[which((dataFilteredAnnot$hgnc_symbol) == "")]
length(ensembl_id_missing_gene)
## [1] 732

We see that we have 732 missing identifiers, lets check out some of these rows with missing identifiers.

kable(dataFilteredAnnot[which((dataFilteredAnnot$hgnc_symbol == ""))[1:5],1:8], type="html", caption = "Table 0.6: Genes with missing identifiers")
Table 0.6: Genes with missing identifiers
ensembl_gene_id hgnc_symbol shNEDD9_1_S10 shNEDD9_2_S11 shNEDD9_3_S12 shNTC_1_S7 shNTC_2_S8 shNTC_3_S9
3142 ENSG00000111788 4.69952 4.834820 4.526757 1.7481001 1.8813065 2.1893384
8607 ENSG00000165121 1.34272 2.544642 1.710108 2.8238540 2.8219597 3.1276263
9080 ENSG00000167912 2.34976 1.017857 1.609514 0.6723462 0.3135511 0.9382879
9416 ENSG00000170089 12.08448 10.178568 11.266595 6.3200541 7.8387770 5.9424901
9425 ENSG00000170161 1.00704 1.017857 1.307730 0.9412847 0.9406532 0.1563813

As the missing identifiers are only ~5% of the data set, we will keep these rows for now and continue with our analysis. Also, we have knowledge of the rows with missing identifiers which will allow us to remove/modify how we handle these missing identifiers easily later on in our analysis and removing these data at this step does not seem the best practice (would be interesting to see how these genes with missing identifiers result in our final analysis).

Final Data

Now we present the final data set that has been filtered, normalized and mapped to HUGO gene symbols with 14683 genes (rows).

kable(head(dataFilteredAnnot), format = "html", caption = "Table 0.7: Final Dataset")
Table 0.7: Final Dataset
ensembl_gene_id hgnc_symbol shNEDD9_1_S10 shNEDD9_2_S11 shNEDD9_3_S12 shNTC_1_S7 shNTC_2_S8 shNTC_3_S9
ENSG00000000003 TSPAN6 8.056319 10.05134 9.355297 14.791616 13.169145 13.448793
ENSG00000000419 DPM1 63.443514 59.16293 58.143675 41.819933 35.431272 42.222956
ENSG00000000457 SCYL3 35.078557 32.82588 33.699189 20.842732 22.889229 22.518910
ENSG00000000460 C1orf112 24.168958 20.99330 21.829027 8.068154 5.330368 8.757354
ENSG00000001036 FUCA2 76.367193 69.34149 69.510864 52.039594 62.710216 52.387741
ENSG00000001084 GCLC 50.519835 58.14507 59.250216 96.414442 91.556915 94.610697
dim(dataFilteredAnnot)
## [1] 14684     8

A2

Differential Gene Expression

Adjustments from A1

Duplicate Genes

First we will remove duplicate genes as I have made a mistake in this step for A1. We see that there are two duplicate gene_ids each with a frequency of 2.

summarizedGeneCounts <- sort(table(dataFilteredAnnot$ensembl_gene_id), decreasing = TRUE)
kable(summarizedGeneCounts[which(summarizedGeneCounts > 1)[1:5]], format = 'html', caption = "Table 1: Duplicate Genes and it's Frequencies.")
Table 1: Duplicate Genes and it’s Frequencies.
Var1 Freq
ENSG00000230417 2
ENSG00000254876 2
NA NA
NA NA
NA NA


We will then remove 1 row for each duplicate ID as it contains exactly the same data.

dataFilteredAnnot <- dataFilteredAnnot[-c(12987, 13672),]
MDS Plot

Then we will check the MDS plots to decide factors in our model as I have not included a MDS plot in A1.

The MDS plot colored by condition (red for shNEDD9 and blue for shNTC) show that expression depends heavily on the experiment condition (execpt shNTC_2 which seems to be an outlier).

plotMDS(d, labels = rownames(samples), col = c("red", "blue")[factor(samples$condition)])
legend(-1.5, 0.4, legend=c("shNEDD9", "shNTC"), fill = c("red", "blue"), cex = 0.6)
Fig. 1: MDS plot coloured by experimental condition (shNEDD9 vs shNTC)

Fig. 1: MDS plot coloured by experimental condition (shNEDD9 vs shNTC)


As we can see the groups are clustered based on cell type except shNTC_2, we will create a design matrix with samples$condition. Then we will create an expression matrix.

Calculate P-value: Limma Method

We first design a model with only condition as factor since our MDS plot shows that condition is the most significant factor.

modelDesign <- model.matrix( ~ samples$condition)
kable(modelDesign[1:6,], type="html", caption = "Table 2: Model Design Matrix with only condition as factor")
Table 2: Model Design Matrix with only condition as factor
(Intercept) samples$conditionshNTC
1 0
1 0
1 0
1 1
1 1
1 1


Then we create an expression matrix and fit our model to get our p-values as shown in the below table.

library(limma)
expressionMatrix <- as.matrix(dataFilteredAnnot[, 3:8])
rownames(expressionMatrix) <- dataFilteredAnnot$ensembl_gene_id
colnames(expressionMatrix) <- colnames(dataFilteredAnnot[3:8])
minimalSet <- ExpressionSet(assayData = expressionMatrix)
fit <- lmFit(minimalSet, modelDesign)
fit2 <- eBayes(fit, trend = TRUE)
topfit <- topTable(fit2, coef = ncol(modelDesign), adjust.method = "BH", number = nrow(expressionMatrix))
outputHits <- merge(dataFilteredAnnot[, 1:2], topfit, by.y= 0, by.x = 1, all.y = TRUE)
kable(outputHits[1:10, 1:8], type = "html", row.names = FALSE, caption = "Table 3: Results from Limma method for differential analysis")
Table 3: Results from Limma method for differential analysis
ensembl_gene_id hgnc_symbol logFC AveExpr t P.Value adj.P.Val B
ENSG00000000003 TSPAN6 4.648867 11.478751 4.873394 0.0002057 0.0004888 -0.0602216
ENSG00000000419 DPM1 -20.425318 50.037379 -7.595332 0.0000017 0.0000063 2.5506801
ENSG00000000457 SCYL3 -11.784252 27.975749 -6.234617 0.0000164 0.0000492 1.4054312
ENSG00000000460 C1orf112 -14.945135 14.857859 -11.479847 0.0000000 0.0000001 4.5631682
ENSG00000001036 FUCA2 -16.027333 63.726184 -4.986969 0.0001650 0.0003994 0.0749498
ENSG00000001084 GCLC 38.222312 75.082862 13.108581 0.0000000 0.0000000 5.0594782
ENSG00000001167 NFYA 12.468577 57.495405 5.076752 0.0001388 0.0003415 0.1801637
ENSG00000001460 STPG1 1.861078 9.591631 1.898136 0.0771967 0.1076563 -4.0498305
ENSG00000001461 NIPAL3 118.396417 168.288342 34.909691 0.0000000 0.0000000 6.8356876
ENSG00000001497 LAS1L -29.382631 76.381169 -10.735174 0.0000000 0.0000001 4.2829923


From the limma method, we see that 10108 genes are significantly deferentially expressed with p-value threshold of 0.05. The threshold of 0.05 was chosen because it is considered the standard (to my knowledge). Eventhough we ended up with more genes passing than expected with the 0.05 p-value threshold, we are keeping the threshold to avoid a form of “p-hacking” (changing the threshold only to reduce the number of genes passing).

length(which(outputHits$P.Value < 0.05))
## [1] 10113


We also see 9728 genes passing correction using the Benjamni - hochberg method with threshold of 0.05. The Benjamni - hochberg method was chosen as I remember it being mentioned as a standard method for RNA seq differential expression analysis from class (correct me if I am wrong here). Also, the threshold of 0.05 was chosen again to be consistant and seems to be the standard threshold in most cases.

length(which(outputHits$adj.P.Val < 0.05))
## [1] 9732

Calculate P-value: EdgeR Quasi liklihood Method

Now we will use the Quasi likelihood of the EdgeR package to calculate p-values as the following:

d <- DGEList(counts=expressionMatrix, group=samples$condition)
d <- estimateDisp(d, modelDesign)
fit <- glmQLFit(d, modelDesign)
r <- glmQLFTest(fit)
qlfOutputHits <- topTags(r, n = nrow(dataFilteredAnnot), adjust.method = "BH")
kable(qlfOutputHits$table[1:10, 1:5], type = "html", row.names = FALSE, caption = "Table 4: Results from Quasi liklihood method for differential analysis")
Table 4: Results from Quasi liklihood method for differential analysis
logFC logCPM F PValue FDR
3.945787 12.193073 26384.356 0 0
2.894103 10.991419 15600.934 0 0
3.872776 10.003619 15287.730 0 0
-2.398194 10.114546 9561.174 0 0
-4.160010 8.748915 8437.388 0 0
-2.577343 9.624646 8436.453 0 0
3.218526 8.940245 8399.667 0 0
1.983406 10.095593 6811.047 0 0
3.004369 8.754974 6793.744 0 0
4.897888 7.848513 6517.317 0 0

From the Quasi likelihood method, we see that 6336 genes are significantly deferentially expressed with p-value threshold of 0.05. Again, we are using a p-value threshold of 0.05 as it seems to be the standard for these differential analysis.

length(which(qlfOutputHits$table$PValue < 0.05))
## [1] 6336


We also see 5354 genes passing correction using the Benjamni - hochberg method with threshold of 0.05. Same reason as the limma method, threshold of 0.05 and Quasi likelihood method were selected as these two were considered “standards” during class.

length(which(qlfOutputHits$table$FDR < 0.05))
## [1] 5354


Volcano Plot of Differentially Expressed Genes (From Quasi Likelihood Method)

We will continue our analysis from results of the Quasi Likelihood Method as “Limma guide direct users to use edgeR up to the point of calculating differential expression.” I purposely went through the Limma method earlier to compare results and practice the process of using Limma, but now we will only consider results from the Quasi Likelihood method for the rest of our analysis.


Now we will create a volcano plot to show the amounts of differentially expressed genes. Non-signiciant genes in grey, up regualated genes in red and down regualated genes in blue. The gene of interest, NEDD9, is colored black to highlight.

col <- vector(mode="character", length = nrow(dataFilteredAnnot))
for (i in 1:nrow(qlfOutputHits$table)) {
  if (qlfOutputHits$table$logFC[i] < 0 && qlfOutputHits$table$FDR[i] < 0.05) {
    col[i] <- "blue"
  } else if (qlfOutputHits$table$logFC[i] > 0 && qlfOutputHits$table$FDR[i] < 0.05) {
    col[i] <- "red"
  } else {
    col[i] <- "grey"
  }
}
  
col[which(row.names(qlfOutputHits$table) == "ENSG00000111859")] <- "black"

plot(qlfOutputHits$table$logFC,
     -log(qlfOutputHits$table$FDR, base=10),
     col = col,
     xlab = "logFC",
     ylab ="-log(FDR)", 
     main="Volcano Plot of Differentially Expressed Genes")

legend(-6.9, 68, legend=c("Up Regulated Genes", "Down Regulated genes", "Non-significant", "NEDD9"), fill = c("blue", "red", "grey", "black"), cex = 0.5)
Fig. 2: Volcano Plot of differentially expressed genes with FDR

Fig. 2: Volcano Plot of differentially expressed genes with FDR


Note that NEDD9 is in the up-regulated gene because I ran the analysis backwards NEDD9 silenced (shNEDD9 condition) vs Control (shNTC condition) instead of control vs silenced.


Heatmap of Top Hits

We will portray a heatmap of the top hits with the following:

if(!requireNamespace("circlize", quietly = TRUE)) {
  install.packages("circlize")
}

if(!requireNamespace("ComplexHeatmap", quietly = TRUE)) {
  BiocManager::install("ComplexHeatmap")
}

library(circlize)
library(ComplexHeatmap)

hmMatrix <- dataFilteredAnnot[, 3:ncol(dataFilteredAnnot)]
rownames(hmMatrix) <- dataFilteredAnnot$ensembl_gene_id

topHits <- rownames(qlfOutputHits$table)[qlfOutputHits$table$FDR < 0.05]
hmTopHits <- t(scale(t(hmMatrix[which(rownames(hmMatrix) %in% topHits), ])))

if (min(hmTopHits) == 0){
    hmCol = colorRamp2(c( 0, max(hmTopHits)), 
                             c( "white", "red"))
  } else {
    hmcol = colorRamp2(c(min(hmTopHits), 0, max(hmTopHits)), c("blue", "white", "red"))
  }

Heatmap(as.matrix(hmTopHits),
        cluster_rows = TRUE, show_row_dend = TRUE,
        cluster_columns = FALSE, show_column_dend = FALSE,
        col=hmcol, show_column_names = TRUE, 
        show_row_names = FALSE, show_heatmap_legend = TRUE, 
        column_title = "Heatmap of Top Hits")
Fig. 3: Heatmap of Top Hits

Fig. 3: Heatmap of Top Hits


From the heatmap, we can notice that the conditions do cluster together (shNEDD9 clustered on the left and shNTC clustered on the right). I believe this is due to the experimental design, when NEDD9 is suppressed the significantly differentiated genes will be up/down regulated in a similar way in the same NEDD9 suppressed condition (Same idea for the control, since NEDD9 is not suppressed, the genes up/down regulated will be similar for the control conditions).


Thresholded Over-Representation Analysis

Creating Thresholded List

We will create a thresholded lists of genes.

upRegGenes <- row.names(qlfOutputHits$table)[which(qlfOutputHits$table$FDR < 0.05 & qlfOutputHits$table$logFC > 0)]
downRegGenes <- row.names(qlfOutputHits$table)[which(qlfOutputHits$table$FDR < 0.05 & qlfOutputHits$table$logFC < 0)]
allGenes <- c(upRegGenes, downRegGenes)


Now we save the list of genes in seperate files (table) for later access.

write.table(x=allGenes,
            "./allGenes.txt",sep='\t',
            row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=upRegGenes,
            "./upRegGenes.txt",sep='\t',
            row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downRegGenes,
            "./downRegGenes.txt",sep='\t',
            row.names = FALSE,col.names = FALSE,quote = FALSE)


g:Profiler

For over-representation analysis, we will use the g:Profiler method as it allows access to multiple annotation data sources such as Go biological process, Reactom and etc within the method. There is also a R package named “gprofiler2” which is a convenient way to run g:profiler within R (we will be doing this).

Let’s begin with loading the “gprofiler2” library.

if (!requireNamespace("gprofiler2", quietly = TRUE)){
  install.packages("gprofiler2")
}
library(gprofiler2)

Annotation Data and versions

For annotation data, we will be using the Go biological process (GO-BP) and Reactome. GO-BP was chosen for the relevant biological processes of the significantly differentiated genes. Reactome was also chosen for relevent biological pathways of the significantly differentiated genes (Reactome and WikiPathways are both databases for biological pathways so I decided to only choose Reactome for this assignment). The combination of GO-BP and Reactome seems to be most reasonable giving annotation data of the biological processes and pathways.


The version for GO-BP:

version <- gprofiler2::get_version_info(organism = "hsapiens")
goBPVersion <- version$sources$`GO-BP`$version
goBPVersion
## NULL


The version for Reactome:

reactomeVersion <- version$sources$REAC$version
reactomeVersion
## [1] "annotations: BioMart\nclasses: 2022-1-3"


From above, we can notice that the used GO-BP version is the 2021-12-15 version. While the Reactome version is the 2022-01-03 version.

Running g:Profiler on all differentially expressed genes

We will first run a g:Profiler query for all significantly differentiated genes with the GO-BP and Reactome annotations.

allGenesResults <- gost(
  allGenes,
  organism = "hsapiens",
  ordered_query = FALSE,
  multi_query = FALSE,
  significant = FALSE,
  exclude_iea = FALSE,
  measure_underrepresentation = FALSE,
  evcodes = FALSE,
  user_threshold = 0.05,
  correction_method = c("fdr"),
  sources = c("GO-BP", "REAC"),
  as_short_link = FALSE
)


Lets check how many genesets are returned with threshold of FDR < 0.05, we can see that there are 14565 gene sets returned from GO-BP and Reactome.

length(allGenesResults$result$term_id)
## [1] 2122


Now, we can get the top GO-BP results for querying all genes:

kable(head(allGenesResults$result[, c("term_name", "term_id", "p_value")]), type = "html", row.names = FALSE, caption = "Table 5: Top GO-BP Results for All Genes")
Table 5: Top GO-BP Results for All Genes
term_name term_id p_value
Cell Cycle REAC:R-HSA-1640170 0
Translation REAC:R-HSA-72766 0
Cell Cycle, Mitotic REAC:R-HSA-69278 0
Metabolism of RNA REAC:R-HSA-8953854 0
Cellular responses to stress REAC:R-HSA-2262752 0
Cellular responses to stimuli REAC:R-HSA-8953897 0


From the GO-BP results of all genes, it seems like the terms are related to cellular metabolic processes (macromolecules, protein and organonitrogen). Since the NEDD9 Gene is a member of the CRK-associated substrates family which play a role as a adhesion docking molecule, the top results of GO-BP (cellular metabolic processes) seems to make sense.


We also can get the top Reactome results for querying all genes:

kable(head(allGenesResults$result[allGenesResults$result$source == "REAC", c("term_name", "source", "term_id", "p_value")]), type = "html", row.names = FALSE, caption = "Table 6: Top Reactome Results for All Genes")
Table 6: Top Reactome Results for All Genes
term_name source term_id p_value
Cell Cycle REAC REAC:R-HSA-1640170 0
Translation REAC REAC:R-HSA-72766 0
Cell Cycle, Mitotic REAC REAC:R-HSA-69278 0
Metabolism of RNA REAC REAC:R-HSA-8953854 0
Cellular responses to stress REAC REAC:R-HSA-2262752 0
Cellular responses to stimuli REAC REAC:R-HSA-8953897 0


From Entrex Gene Summary of NEDD9, it is known that NEDD9 plays a role in apoptosis and the cell cycle. Thus, the results from Reactome makes a lot of sense with the Cell Cycle term as the top result. I’m guessing the Cellular Response to stress/stimuli term is regarding apoptosis/cell cycle (but I am not too sure).


Running g:Profiler on Up-Regulated Genes

Now we will run the same query but only with the up-regulated genes.

upRegGenesResults <- gost(
  upRegGenes,
  organism = "hsapiens",
  ordered_query = FALSE,
  multi_query = FALSE,
  significant = FALSE,
  exclude_iea = FALSE,
  measure_underrepresentation = FALSE,
  evcodes = FALSE,
  user_threshold = 0.05,
  correction_method = c("fdr"),
  sources = c("GO-BP", "REAC"),
  as_short_link = FALSE
)


Lets check how many genesets are returned with threshold of FDR < 0.05, we can see that there are 12087 gene sets returned from GO-BP and Reactome.

length(upRegGenesResults$result$term_id)
## [1] 1853

Now, we can get the top GO-BP results for querying up-regulated genes:

kable(head(upRegGenesResults$result[, c("term_name", "term_id", "p_value")]), type = "html", row.names = FALSE, caption = "Table 7: Top GO-BP Results for Up-Regulated Genes")
Table 7: Top GO-BP Results for Up-Regulated Genes
term_name term_id p_value
Signaling by Rho GTPases, Miro GTPases and RHOBTB3 REAC:R-HSA-9716542 0
Membrane Trafficking REAC:R-HSA-199991 0
Signaling by Rho GTPases REAC:R-HSA-194315 0
RHO GTPase cycle REAC:R-HSA-9012999 0
Signal Transduction REAC:R-HSA-162582 0
Vesicle-mediated transport REAC:R-HSA-5653656 0


The first thing we notice is the top terms are different from the top terms of the “all genes” GO-BP query. The up-regulated genes seem to be more focused on localization and development process.


We also can get the top Reactome results for querying up-regulated genes:

kable(head(upRegGenesResults$result[upRegGenesResults$result$source == "REAC", c("term_name", "source", "term_id", "p_value")]), type = "html", row.names = FALSE, caption = "Table 8: Top Reactome Results for Up-Regulated Genes")
Table 8: Top Reactome Results for Up-Regulated Genes
term_name source term_id p_value
Signaling by Rho GTPases, Miro GTPases and RHOBTB3 REAC REAC:R-HSA-9716542 0
Membrane Trafficking REAC REAC:R-HSA-199991 0
Signaling by Rho GTPases REAC REAC:R-HSA-194315 0
RHO GTPase cycle REAC REAC:R-HSA-9012999 0
Signal Transduction REAC REAC:R-HSA-162582 0
Vesicle-mediated transport REAC REAC:R-HSA-5653656 0


We also notice here that the top terms are different from the top terms of the “all genes” Reactome query. The up-regualted genes (genes more expressed in control since we did the analysis in reverse like mentioned earlier) seem to be more focused on membrane trafficking / signaling while “all genes” query is focusing more on cell cycle.

Running g:Profiler on Down-Regulated Genes

Now we will run the same query but only with the down-regulated genes.

downRegGenesResults <- gost(
  downRegGenes,
  organism = "hsapiens",
  ordered_query = FALSE,
  multi_query = FALSE,
  significant = FALSE,
  exclude_iea = FALSE,
  measure_underrepresentation = FALSE,
  evcodes = FALSE,
  user_threshold = 0.05,
  correction_method = c("fdr"),
  sources = c("GO-BP", "REAC"),
  as_short_link = FALSE
)


Lets check how many genesets are returned with threshold of FDR < 0.05, we can see that there are 11299 gene sets returned from GO-BP and Reactome.

length(downRegGenesResults$result$term_id)
## [1] 1717


Now, we can get the top GO-BP results for querying down-regulated genes:

kable(head(downRegGenesResults$result[, c("term_name", "term_id", "p_value")]), type = "html", row.names = FALSE, caption = "Table 9: Top GO-BP Results for Down-Regulated Genes")
Table 9: Top GO-BP Results for Down-Regulated Genes
term_name term_id p_value
Translation REAC:R-HSA-72766 0
Cell Cycle REAC:R-HSA-1640170 0
Metabolism of RNA REAC:R-HSA-8953854 0
Cell Cycle, Mitotic REAC:R-HSA-69278 0
rRNA processing REAC:R-HSA-72312 0
Cap-dependent Translation Initiation REAC:R-HSA-72737 0


Interestingly, the GO-BP results from down-regulated genes are similar to the “all genes” results with both results focusing on cellular macromolecule processes/metabolic processes.


We also can get the top Reactome results for querying down-regulated genes:

kable(head(downRegGenesResults$result[downRegGenesResults$result$source == "REAC", c("term_name", "source", "term_id", "p_value")]), type = "html", row.names = FALSE, caption = "Table 10: Top Reactome Results for Down-Regulated Genes")
Table 10: Top Reactome Results for Down-Regulated Genes
term_name source term_id p_value
Translation REAC REAC:R-HSA-72766 0
Cell Cycle REAC REAC:R-HSA-1640170 0
Metabolism of RNA REAC REAC:R-HSA-8953854 0
Cell Cycle, Mitotic REAC REAC:R-HSA-69278 0
rRNA processing REAC REAC:R-HSA-72312 0
Cap-dependent Translation Initiation REAC REAC:R-HSA-72737 0


The Reactome results from down-regulated genes are also similar to the “all genes” results with both results focusing on cell cycle and translation.


A3: Data set Pathway and Network Analysis

Important note for further interpretation

As mentioned also in A2, note that NEDD9 is in the up-regulated gene because I ran the analysis backwards NEDD9 silenced (shNEDD9 condition) vs Control (shNTC condition) instead of control vs silenced. This is important for interpretation since the positive enrichment will mean more expressed in the NEDD silenced condition and negative enrichment will mean less expressed in the NEDD silenced condition. Essentially, the results should be interpretated backwards from the conventional Control vs Silenced model.

Non-thresholded Gene set Enrichment Analysis

Creating Ranked Genes List

First, we will create a ranked set of genes according to the GSEA format from our ranked genes list in A2 as we did not create one in A2.

qlfOutputHits$table[, "rank"] <- -log(qlfOutputHits$table$PValue, base = 10) * sign(qlfOutputHits$table$logFC)
qlfOutputHits$table <- merge(dataIdConversion, qlfOutputHits$table, by.x = 1, by.y = 0, all.y=TRUE)
qlfOutputHits$table <- qlfOutputHits$table[order(qlfOutputHits$table$rank, decreasing = TRUE),]

ranked_genes_list <- data.frame(GeneName = qlfOutputHits$table$hgnc_symbol, rank = qlfOutputHits$table[,"rank"])

write.table(x=ranked_genes_list, file="ranked_genes_list.rnk",sep = "\t", row.names = FALSE, quote = FALSE)

kable(head(ranked_genes_list), type = "html", row.names = FALSE, caption = "Table 1: Ranked Genes List for GSEA")
Table 1: Ranked Genes List for GSEA
GeneName rank
REG4 69.65442
IGFBP3 63.90161
PLA2G2A 63.67974
WDR7 57.13595
THBS1 54.84997
ADAMTS1 54.82225

Download geneset database from Bader Lab

Now we will download the required geneset from Bader Lab for GSEA input. Note that we will use HUGO gene symbols that we matched in A1 for our gene identifier as GSEA will expect HUGO symbols because our geneset database is annotated in HUGO symbols.

if (!requireNamespace("RCurl", quietly = TRUE)){
  install.packages("RCurl")
}
library(RCurl)

gmt_url <- "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
filenames <- getURL(gmt_url)
tc <- textConnection(filenames)
contents <- readLines(tc)
close(tc)
rx <- gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea_.*.)(.gmt)(?=\">)",
              contents, perl = TRUE)
gmt_file <- unlist(regmatches(contents, rx))
if (!file.exists(paste("data/", gmt_file, sep = ""))){
  dest <- file.path(paste(getwd(), "data", sep = ""), gmt_file)
  download.file(paste(gmt_url,gmt_file,sep=""), destfile=dest)
}

Running GSEA

For the Non-thresholded Gene set Enrichment Analysis, I will be using the GSEA Preranked method of GSEA software version 4.2.3. I will be running GSEA 4.2.3 on my local laptop (M1 Mac) with the following parameters:

  • Required Fields:
    • Gene set database: Bader Lab Geneset from April 2022
    • Number of Permutations: 1000
    • Ranked list: ranked_genes_list
    • Collapse/ Remap: No collapse
  • Basic Fields:
    • Max size: 200
    • Min size: 15

GSEA Results

Missing GeneName Warnings

As there were missing HUGO symbols for geneIds (check A1 Mapping Data Section for detail), GSEA gave a warning saying 75 genes were ignored due to N/A values. I decided to move on with the warnings as these genes were only a very small part of the data, but I will take a note of this for our future analysis. Please let me know if there are better methods to handle this.

Positive Enrichment Score Results Summary

Table 2 shows a summary of the top gene sets for positive enrichment from the GSEA analysis. From the result, we see the top positive enrichment results are related to exocytosis, extracellular structure organization and regulation of exocytosis.

gsea_pos_results <- read.delim("gsea_report_for_na_pos.tsv", row.names = NULL)
gsea_pos_results <- data.frame(name = gsea_pos_results$NAME, size = gsea_pos_results$SIZE, ES = gsea_pos_results$ES, NES = gsea_pos_results$NES, NOM_p_val = gsea_pos_results$NOM.p.val, FDR_q_val = gsea_pos_results$FDR.q.val, FWER_p_val = gsea_pos_results$FWER.p.val, rank_at_max = gsea_pos_results$RANK.AT.MAX, leading_edge = gsea_pos_results$LEADING.EDGE)
kable(head(gsea_pos_results), type = "html", row.names = FALSE, caption = "Table 2: GSEA Positive Enrichment Results")
Table 2: GSEA Positive Enrichment Results
name size ES NES NOM_p_val FDR_q_val FWER_p_val rank_at_max leading_edge
EXOCYTOSIS%GOBP%GO:0006887 130 0.7220704 1.949632 0 0.0023131 0.002 2168 tags=33%, list=16%, signal=39%
SENSORY PERCEPTION%REACTOME%R-HSA-9709957.3 118 0.7260143 1.933759 0 0.0011566 0.002 1426 tags=29%, list=10%, signal=32%
EXTRACELLULAR STRUCTURE ORGANIZATION%GOBP%GO:0043062 111 0.7326065 1.926554 0 0.0011503 0.003 1228 tags=23%, list=9%, signal=25%
EXTRACELLULAR MATRIX ORGANIZATION%GOBP%GO:0030198 110 0.7328299 1.919529 0 0.0008627 0.003 1228 tags=24%, list=9%, signal=26%
REGULATION OF EXOCYTOSIS%GOBP%GO:0017157 121 0.7286935 1.918481 0 0.0006902 0.003 1733 tags=26%, list=12%, signal=29%
EXTERNAL ENCAPSULATING STRUCTURE ORGANIZATION%GOBP%GO:0045229 112 0.7301717 1.898211 0 0.0007655 0.004 1228 tags=23%, list=9%, signal=25%

Negative Enrinchment Score Results Summary

Table 3 shows a summary of the top gene sets for negative enrichment from the GSEA analysis. From the result, we see the top negative enrichment results are related to various translations and translation initiation.

gsea_neg_results <- read.delim("gsea_report_for_na_neg.tsv", row.names = NULL)
gsea_neg_results <- data.frame(name = gsea_neg_results$NAME, size = gsea_neg_results$SIZE, ES = gsea_neg_results$ES, NES = gsea_neg_results$NES, NOM_p_val = gsea_neg_results$NOM.p.val, FDR_q_val = gsea_neg_results$FDR.q.val, FWER_p_val = gsea_neg_results$FWER.p.val, rank_at_max = gsea_neg_results$RANK.AT.MAX, leading_edge = gsea_neg_results$LEADING.EDGE)
kable(head(gsea_neg_results), type = "html", row.names = FALSE, caption = "Table 3: GSEA Negative Enrichment Results")
Table 3: GSEA Negative Enrichment Results
name size ES NES NOM_p_val FDR_q_val FWER_p_val rank_at_max leading_edge
CAP-DEPENDENT TRANSLATION INITIATION%REACTOME DATABASE ID RELEASE 79%72737 115 -0.9059194 -2.385736 0 0 0 823 tags=63%, list=6%, signal=66%
EUKARYOTIC TRANSLATION INITIATION%REACTOME DATABASE ID RELEASE 79%72613 115 -0.9059194 -2.370838 0 0 0 823 tags=63%, list=6%, signal=66%
CYTOPLASMIC TRANSLATION%GOBP%GO:0002181 120 -0.9004626 -2.346311 0 0 0 823 tags=60%, list=6%, signal=63%
NONSENSE-MEDIATED DECAY (NMD)%REACTOME%R-HSA-927802.2 109 -0.9041099 -2.336048 0 0 0 651 tags=55%, list=5%, signal=57%
GTP HYDROLYSIS AND JOINING OF THE 60S RIBOSOMAL SUBUNIT%REACTOME%R-HSA-72706.2 108 -0.9082190 -2.334970 0 0 0 970 tags=68%, list=7%, signal=72%
HALLMARK_E2F_TARGETS%MSIGDB_C2%HALLMARK_E2F_TARGETS 152 -0.8626322 -2.331709 0 0 0 1311 tags=72%, list=9%, signal=79%

Thresholded Analysis results vs Non-thresholded Analysis results

For the positive enrichment results, we see the top terms of the thresholded analysis results are related to signaling and Membrane Trafficking; while top terms of the non-thresholded analysis results are related to exocytosis, extracellular structure organization. There is minimal similarity between the two results as the top results are mostly different between the Thresholded and Non-thresholded analysis.


However, the negative enrichment results from the thresholded and non-thresholded analysis are quite similar. Both thresholded and non-thresholded analysis top results are related to translation. The main difference I noticed was that the non-thresholded analysis results were more specific (ex. CAP-dependent translation) while the thresholded analysis result only mentions “translation.”


The comparison doesn’t seem to be straightforward comparison as non-thresholded takes account of all genes (hence non-thresholded) while thresholded analysis only takes genes that pass the threshold. In addition, the 75 genes ignored in GSEA make it seem like a not straightforward comparison as these genes were used in thresholded analysis (as g:profiler used gene ids and not the mapped hugo symbols).

Data set Pathway and Network Analysis

Creating Enrichment Map and Used Parameters

Cytoscape 3.9.1 (Shannon et al. 2003) with the Enrichment Map plugin 3.3.3 (Merico et al. 2010) was used for creating the enrichment map. The following parameters/thresholds were used to creat the enrichment map:

  • P-value: 0.005
  • FDR Q-value: 0.075
  • Overlap (edge): 0.5

I decided to use these thresholds as it was recommended in the enrichment map manual for moderately conservative thresholds.

The enrichment map before manual layout:

Figure 1. Enrichment Map before Manual Layout

Annotating the Network

For annotating the network I used the AutoAnnotate plugin (Kucera et al. 2016) to cluster and annotate the network with the following parameters:

  • Quick Start Annotation
    • Annotate entire network
    • Layout network to prevent cluster overlap
    • Label column: GS_DESCR

I annotated the entire network to not miss out on some nodes, but I could always manually hide some small clusters that I don’t need later on for a clearer image.

Publication ready figure

The publication ready figure was created with the publication-ready option with all annotations:

Figure 2. Publication Ready Enrichment Map

Theme Network

The network was collapsed into a theme network using the AutoAnnotate Create Summary Network functionality with all nodes/clusters collapsed. Theme network:


Figure 3. Theme Network of all nodes/clusters


Main themes include translation, signaling and mitosis/meiosis cycles. The main themes don’t exactly fit with the model of over-expression NEDD9 causing prostate cancer progression and metastasis. However, themes such as signaling, cell cycles and translation may be part of promoting cancer growth and metastasis. Lastly, I don’t notice any novel pathways/themes present in the theme network.

Interpretation

Do the enrichment results support conclusions or mechanism discussed in the original paper? How do these results differ from the results you got from Assignment #2 thresholded methods.

The enrichment results do not directly support the conclusions or mechanism discussed in the original paper, but do show some relevance to the conclusions. The original paper states that NEDD9 gene is suspected to play a role in tumor growth / metastasis for prostate cancer(Han et al. 2021). From the enrichment analysis, we see results related to translation, cell cycle regulation and signaling pathways that may be relevant to prostate cancer growth and metastasis. However, I believe this is not enough to completely conclude that the enrichment results really support the conclusions discussed in the paper.


Comparing the GSEA results with the thresholded analysis results from A2, there is high similarity between the negative enrichment results and minal similarity between the positive enrichment results. Both thresholded and non-thresholded negative enrichment results show translation as the top results showing similarity. However, the positive enrichment results differ between the non-tresholded and thresholded results as thresholded top results show signalling and membrane trafficking, meanwhile non-thresholded top results show exocytosis and extracellular structure organization.


Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your result?

From doing more research, I found that the mechanism of how NEDD9 regulates metastasis and cancer growth is largely unknown (Shagisultanova et al. 2015) which is the reason why I can’t completely conclude that the enrichment results support the conclusion that NEDD9 promotes metastasis/cell invasion. However, the paper NEDD9/Arf6-dependent endocytic trafficking of matrix metalloproteinase 14 (Shagisultanova et al. 2015), discusses the role of NEDD9 in redistribution of MMP14 to the cell surface (extra-cellular structure organization) and trafficking which support some main results (signalling, membrane trafffiking and extracellular structure organization) from the enrichment analysis.

Post Analysis: Drug Target

Geneset and Parameters

I will be adding a post analysis to the main network using the drug bank gene set to find potential drugs related to our network. I decided to add drugs for post-analysis to see if any cancer/metastasis related drugs will come out as top hits to support conclusions from the original paper. In addition, I was generally curious if any drug related to the target NEDD9 gene will come up as a top result.


The following gene set and parameters were used for adding the post analysis:

  • Gene set
    • The Bader lab Human_DrugBank_approved_symbol.gmt gene set from April 2022 was used for post analysis to find approved drugs that may be relevent to our analysis.
    • Loaded from web.
  • Edge Weight Parameter
    • Mann-Whitney (One Sided Greater)
    • Cutoff 0.05

Post Analysis Results

A total of 161 gene sets out of 1972 gene sets passed the Mann-Whitney (One Sided Greater) cutoff of 0.05. The top hits are shown in Figure 4. below with fostamatinib as the top result.

Figure 4. DrugBank Top Passed Gene Set list


For the network, I only added the top hit Fostamatinib in the network for clarity and because most drugs were not very relevent with the mechanism dicussed in the paper: Susceptibility-associated genetic variation in NEDD9 contributes to prostate cancer initiation and progression (Han et al. 2021). The snapshot of the network with added post analysis is shown in Figure 5.

Figure 5. Network with Post Analysis Top Hit Fostamatinib


Taking a more detailed look into Fostamatinib in DrugBank, Fostamatinib is a tyrosine kinase inhibitor used to treat chronic immune thrombocytopenia (Wishart et al. 2018). Although the original paper (Han et al. 2021) discusses the potential role of the NEDD9 gene in prostate cancer and not chronic immune thrombocytopenia, it is interesting to see Fostamatinib as the top result as sources such as sources such as GeneCards (Szklarczyk et al. 2015) show NEDD9’s role in chemokine-induced T cell migration and T cell receptor (TCR)–mediated integrin activation.

The post-analysis did not show support for the mechanism/conclusions made in the original paper, but did support other NEDD9 mechanisms discussed in papers such as Preclinical and clinical studies of the NEDD9 scaffold protein in cancer and other diseases (Szklarczyk et al. 2015). In addition, the post-analysis result aligns with the NEDD9 gene description from GeneCards (Szklarczyk et al. 2015).

Used Packages/Applications for Analysis

Packages/Applications used: Biomanager(Cattley and Arthur 2007), GEOmetadb(GEOquery et al., n.d.), BioMart(Durinck et al. 2005), knitR(Xie 2018), Circlize(Gu et al. 2014), ComplexHeatmaps(Gu, Eils, and Schlesner 2016), Limma(Ritchie et al. 2015), edgeR(Robinson, McCarthy, and Smyth 2010), GSEA(Subramanian et al. 2007), cytoscape(Shannon et al. 2003), RCurl(Lang 2007).

Citations

Cattley, Sonia, and Jonathan W Arthur. 2007. “BioManager: The Use of a Bioinformatics Web Application as a Teaching Tool in Undergraduate Bioinformatics Training.” Briefings in Bioinformatics 8 (6): 457–65.
Durinck, Steffen, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma, and Wolfgang Huber. 2005. “BioMart and Bioconductor: A Powerful Link Between Biological Databases and Microarray Data Analysis.” Bioinformatics 21 (16): 3439–40.
GEOquery, Depends, RSQLite Author Jack Zhu, Sean Davis, and Maintainer Jack Zhu. n.d. “Package ‘GEOmetadb’.”
Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics 32 (18): 2847–49.
Gu, Zuguang, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. 2014. “Circlize Implements and Enhances Circular Visualization in r.” Bioinformatics 30 (19): 2811–12.
Han, Dong, Jude N Owiredu, Bridget M Healy, Muqing Li, Maryam Labaf, Jocelyn S Steinfeld, Susan Patalano, et al. 2021. “Susceptibility-Associated Genetic Variation in Nedd9 Contributes to Prostate Cancer Initiation and Progression.” Cancer Research 81 (14): 3766–76.
Kucera, Mike, Ruth Isserlin, Arkady Arkhangorodsky, and Gary D Bader. 2016. “AutoAnnotate: A Cytoscape App for Summarizing Networks with Semantic Annotations.” F1000Research 5.
Lang, Duncan Temple. 2007. “R as a Web Client–the RCurl Package.” Journal of Statistical Software 10 (2).
Merico, Daniele, Ruth Isserlin, Oliver Stueker, Andrew Emili, and Gary D Bader. 2010. “Enrichment Map: A Network-Based Method for Gene-Set Enrichment Visualization and Interpretation.” PloS One 5 (11): e13984.
Ritchie, Matthew E, Belinda Phipson, DI Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47–47.
Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.
Shagisultanova, Elena, Anna V Gaponova, Rashid Gabbasov, Emmanuelle Nicolas, and Erica A Golemis. 2015. “Preclinical and Clinical Studies of the Nedd9 Scaffold Protein in Cancer and Other Diseases.” Gene 567 (1): 1–11.
Shannon, Paul, Andrew Markiel, Owen Ozier, Nitin S Baliga, Jonathan T Wang, Daniel Ramage, Nada Amin, Benno Schwikowski, and Trey Ideker. 2003. “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks.” Genome Research 13 (11): 2498–2504.
Subramanian, Aravind, Heidi Kuehn, Joshua Gould, Pablo Tamayo, and Jill P Mesirov. 2007. “GSEA-p: A Desktop Application for Gene Set Enrichment Analysis.” Bioinformatics 23 (23): 3251–53.
Szklarczyk, Damian, Andrea Franceschini, Stefan Wyder, Kristoffer Forslund, Davide Heller, Jaime Huerta-Cepas, Milan Simonovic, et al. 2015. “STRING V10: Protein–Protein Interaction Networks, Integrated over the Tree of Life.” Nucleic Acids Research 43 (D1): D447–52.
Wishart, David S, Yannick D Feunang, An C Guo, Elvis J Lo, Ana Marcu, Jason R Grant, Tanvir Sajed, et al. 2018. “DrugBank 5.0: A Major Update to the DrugBank Database for 2018.” Nucleic Acids Research 46 (D1): D1074–82.
Xie, Yihui. 2018. “Knitr: A Comprehensive Tool for Reproducible Research in r.” In Implementing Reproducible Research, 3–31. Chapman; Hall/CRC.